Uncovering Patterns and Anomalies in Manufacturing Data

INFO 523 - Final Project

Project description
Author
Affiliation

Cesar Castro

College of Information Science, University of Arizona

Abstract

In recent decades, industry continues modernizing processes and equipment. This is creating enormous amounts of data that in many cases might be underutilized. Vast and rich data from machines, like temperatures, vibration, and pressure, are constantly monitored. Traditional statistical process control is vastly used to oversee key process parameters and provide feedback to technicians when something is behaving abnormally. In recent years, with the explosion of AI, industry has been looking at different approaches to monitor these parameters and use different techniques to more effectively predict or detect defects in products or problems.

This study focuses on using machine learning algorithms like random forest and gradient boosting to predict failures and the type of failure. Real factory data has noise, interactions with many variables, and is highly imbalanced. Machines are expected to run all the time without failure, and processes ideally will produce products without any defects, which makes data highly skewed toward a good state. Tuning models to handle this imbalance is critical. Different sampling methods were evaluated: creating synthetic data to over-sample the negative, under-sampling the positive, and giving weights are approaches that can be used.

A second study was done on time-series data, where algorithms like ARIMA and LSTM were used to detect outliers over time. One big challenge is that manufacturing KPIs are typically centered around a target, and variation is random, ideally following a normal distribution but not necessarily following a pattern, which limits the use of these algorithms to predict results. Results from this approaches can still be used to detect outliers, but might not be the best approach.

Presentation: Panopto 🎥

Background

To explore machine algorithms to detect fails, a dataset from Kaggle was used. The data “Machine Predictive Maintenance Classification is a synthetic dataset that reflects a real use case in the industry, based on the source. The dataset consists of 10000 rows with 14 different features.

Key Features: (Source: https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification)

  • UID: unique identifier ranging from 1 to 10000

  • productID: consisting of a letter L, M, or H for low (50% of all products), medium (30%), and high (20%) as product quality variants and a variant-specific serial number

  • Type: is a columns that consist only on the letters L, M and H from productID.

  • air temperature [K]: generated using a random walk process later normalized to a standard deviation of 2 K around 300 K

  • process temperature [K]: generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.

  • rotational speed [rpm]: calculated from powepower of 2860 W, overlaid with a normally distributed noise

  • torque [Nm]: torque values are normally distributed around 40 Nm with an σ = 10 Nm and no negative values.

  • tool wear [min]: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process.

  • Machine failure: label that indicates, whether the machine has failed in this particular data point for any of the following failure modes are true

UDI Product ID Type Air temperature [K] Process temperature [K] Rotational speed [rpm] Torque [Nm] Tool wear [min] Target Failure Type
0 1 M14860 M 298.1 308.6 1551 42.8 0 0 No Failure
1 2 L47181 L 298.2 308.7 1408 46.3 3 0 No Failure
2 3 L47182 L 298.1 308.5 1498 49.4 5 0 No Failure
3 4 L47183 L 298.2 308.6 1433 39.5 7 0 No Failure
4 5 L47184 L 298.2 308.7 1408 40.0 9 0 No Failure

Table1: Example of 5 rows of the synthetic data used for predictive modeling.

Process temperature [K] Mean: 310.01
Process temperature [K] Median: 310.10
Process temperature [K] Max: 313.80
Process temperature [K] Min: 305.70
Process temperature [K] Standard Deviation: 1.48
Process temperature [K] Number of Points: 10000

Figure 1: Example of the distribution observed for 1 synthetic parameters (Process Temperature). Visually data seems not having a significant skew, relatively close to a normal distribution.

Rotational speed [rpm] Mean: 1538.78
Rotational speed [rpm] Median: 1503.00
Rotational speed [rpm] Max: 2886.00
Rotational speed [rpm] Min: 1168.00
Rotational speed [rpm] Standard Deviation: 179.28
Rotational speed [rpm] Number of Points: 10000

Figure 2: Example of the distribution observed for 1 synthetic parameters (Rotational speed [rpm]). On this case data is slightly skewed which in some cases is expected for real machine data.

Table 1 and Figure 1 are examples of what the data looks like; there are no missing data in this dataset as it was synthetically created, which is not normal in a real scenario. Figure 2 shows another parameter where the data are skewed. After observing all parameters, the dataset is a good representation for exploring machine learning algorithms and is able to support conclusions from the analysis.

Data was standardized using sklearn standardscaler and categorical features encoded before moving to training step.

All detailed data cleaning and exploration can be found here: https://github.com/INFO-523-SU25/final-project-castro/blob/main/src/Data_Exploration_PDM.ipynb

The second dataset consists of a simulated real-time sensor data from industrial machines. Source is also from Kaggle and it can be found here: https://www.kaggle.com/datasets/ziya07/intelligent-manufacturing-dataset/data

Key Features:

  • Industrial IoT Sensor Data

    • Temperature_C, Vibration_Hz, Power_Consumption_kW,
  • Network Performance:

    • Network_Latency_ms, Packet_Loss_%, Quality_Control_Defect_Rate_%
  • Production Indicators:

    • Production_Speed_units_per_hr, Predictive_Maintenance_Score, Error_Rate_%
  • Target Column Efficiency_Status

Timestamp Machine_ID Operation_Mode Temperature_C Vibration_Hz Power_Consumption_kW Network_Latency_ms Packet_Loss_% Quality_Control_Defect_Rate_% Production_Speed_units_per_hr Predictive_Maintenance_Score Error_Rate_% Efficiency_Status
0 2024-01-01 00:00:00 39 Idle 74.137590 3.500595 8.612162 10.650542 0.207764 7.751261 477.657391 0.344650 14.965470 Low
1 2024-01-01 00:01:00 29 Active 84.264558 3.355928 2.268559 29.111810 2.228464 4.989172 398.174747 0.769848 7.678270 Low
2 2024-01-01 00:02:00 15 Active 44.280102 2.079766 6.144105 18.357292 1.639416 0.456816 108.074959 0.987086 8.198391 Low
3 2024-01-01 00:03:00 43 Active 40.568502 0.298238 4.067825 29.153629 1.161021 4.582974 329.579410 0.983390 2.740847 Medium
4 2024-01-01 00:04:00 8 Idle 75.063817 0.345810 6.225737 34.029191 4.796520 2.287716 159.113525 0.573117 12.100686 Low

Table 2: Example of 5 rows of the synthetic data that will be used for time series analysis.

Power_Consumption_kW Mean: 5.75
Power_Consumption_kW Median: 5.76
Power_Consumption_kW Max: 10.00
Power_Consumption_kW Min: 1.50
Power_Consumption_kW Standard Deviation: 2.45
Power_Consumption_kW Number of Points: 100000

Figure 3: Example of the distribution observed on second dataset for parameter Power_Consumption_kW. Data does not follow a specific distribution, it seems randomly created over a specific range.

As can be observed in Figure 3, the data from the second dataset seem more randomly created without following a specific distribution. Data in a real scenario for a machine typically follow some type of distribution and is not completely random; a common situation is that processes typically have a target value or values over time, and there is some natural variation around them. The raw data as they are from this source are not usable for the purpose of this study. In order to make the data more like a real scenario, a mean was calculated every 12 hours for each machine.

Results from the data transformation can be observed in figures 4 and 5.

Power_Consumption_kW Mean: 5.74
Power_Consumption_kW Median: 5.75
Power_Consumption_kW Max: 8.29
Power_Consumption_kW Min: 2.96
Power_Consumption_kW Standard Deviation: 0.67
Power_Consumption_kW Number of Points: 6950

Figure 4: Example of the distribution after transformation.

Figure 5: Example trend for Power_Consumption_kW for one of the machines.

All detailed data cleaning and exploration for the second dataset can be found here: https://github.com/INFO-523-SU25/final-project-castro/blob/main/src/Data_Exploration_MFG6G.ipynb

Model Training

The first objective of this study is to build a classification model for failures. The model will analyze data from Table 1 to accurately predict the specific failures and failure types.

The initial model will be focused on predicting the Target feature as a binary classification, basically pass or fail based on the dataset. The data were split using 70% for training and 30% for testing using stratification to maintain the distribution of 0s/1s for each set, as the data is highly imbalanced.

ROC-AUC Results

name ROC_AUC
0 Nearest_Neighbors 0.784725
1 Gradient_Boosting 0.821026
2 Decision_Tree 0.932831
3 Extra_Trees 0.909882
4 Random_Forest 0.972481
5 Neural_Net 0.918822
6 AdaBoost 0.899753
7 Naive_Bayes 0.827014
8 QDA 0.857405
9 LogisticRegression 0.880628

Table 4. ROC AUC results for multiple models evaluated

F1 Score Results

name f1 score
0 Nearest_Neighbors 0.423077
1 Gradient_Boosting 0.513966
2 Decision_Tree 0.503145
3 Extra_Trees 0.347826
4 Random_Forest 0.567742
5 Neural_Net 0.227642
6 AdaBoost 0.473373
7 Naive_Bayes 0.198758
8 QDA 0.386740
9 LogisticRegression 0.224000

Table 5. F1 Score for multiple models evaluated

This initial exploration of multiple options resulted in very high ROC_AUC scores for most of the models and not so great results for the F1 score. Based on these results we can observed: models might be over-fitting the data, resulting in high scores, and second, because data is highly imbalanced, model is great at predicting 0s (good) as they represent the vast majority. When looking at the F1 score, the precision and recall for predicting the 1s is not the best. For the purposes of this study, predicting the 1s (fails) is the main purpose in a real industry use case.

Based on this initial results two models will be further evaluated Random Forest Classifer and XGBoost. Fine tuning hyper-parameters and working multiple methods of sampling to reduce or manage the imbalance of the data.

For hyper-parameter tuning a combination of sklearn.model_selection - GridSearchCV and RandomizedSearchCV were used to run over multiple options.

Final hyper-parameters for Random Forest Classifier Model:

   model = RandomForestClassifier(n_estimators=50,
                                   max_depth=10,
                                   random_state=42,
                                   max_features='log2',
                                   min_samples_leaf=5,
                                   min_samples_split=5)

Results for Random Forest Classifier Model:

RandomForestClassifier: ROC AUC on test dataset: 0.9751
RandomForestClassifier: f1 score on test dataset: 0.6258

Cross validation is used to understand if model is over-fitting:

Cross validation results for Random Forest
fit_time Cross Validation results 0.21
score_time Cross Validation results 0.01
test_accuracy Cross Validation results 0.90
train_accuracy Cross Validation results 0.99
test_precision Cross Validation results 0.69
train_precision Cross Validation results 0.97
test_recall Cross Validation results 0.46
train_recall Cross Validation results 0.72
test_f1 Cross Validation results 0.45
train_f1 Cross Validation results 0.83
test_roc_auc Cross Validation results 0.90
train_roc_auc Cross Validation results 1.00

After tuning the model, the best F1 score obtained is 0.62. Cross-validation results suggest the model is over-fitting and might not be able to generalize well.

Different techniques were explored to see if the model would improve. First attemp was using Synthetic Minority Oversampling Technique (SMOTE) from the imblearn library, the intent of this method is to over-sample the minority class by creating synthetic data. The results from this were worse than the original model. Additionally, a under-sampling method RandomUnderSampler from the same imblearn library was tested; on this case, it was used to reduce the sample of the majority class and try to balance the data. The results were not better than original tuned model.

Additionally, to improve the F1 score, a change in the probability threshold was explored; instead of using the normal 0.5, an analysis was done to estimate the ideal point to optimized the F1-Score.

Figure 6. Values of Recall, Precision and F1 Score metrics for every threshold.

Figure 7. Confusion Matrix for results of Random Forest Classifier for default threshold 0.5.

Figure 8. Confusion Matrix for results of Random Forest Classifier for optimized threshold 0.28.

As observed in Figures 7 and 8, after improving the threshold a balance can be found between precision and recall. Depending on the use case, this optimization can be used to improve one of them, in some cases model might need to be tune in a specific direction to reduce over-reject or under-reject.

A second approach is to use XGBoost; this model has the option to handle weights for each class. By adding a higher weight to the minority class, it is expected to handle the imbalance in the dataset better.

XGBoost Results:

F1 Score: 0.757
Cross validation results for Random Forest
fit_time Cross Validation results 0.07
score_time Cross Validation results 0.01
test_accuracy Cross Validation results 0.98
train_accuracy Cross Validation results 1.00
test_precision Cross Validation results 0.74
train_precision Cross Validation results 1.00
test_recall Cross Validation results 0.72
train_recall Cross Validation results 1.00
test_f1 Cross Validation results 0.73
train_f1 Cross Validation results 1.00
test_roc_auc Cross Validation results 0.97
train_roc_auc Cross Validation results 1.00

The F1 score for XGBoost without significant tuning and using weights is better than the random forest model originally used. Cross-validation results are still showing some level of over-fitting but improved from the original model. Hyperparameter tuning was done with a similar approach as with random forest, but no significant improvement was observed.

Figure 9. Confusion Matrix for results of XGBoost Model.

XGboost results are better to the random forest optimized threshold model. Based on the cross validation results over-fitting seems also slightly better for the XGBoost model.

Predicting the Failure Type

Since the XGBoost results were slightly better, this model will be used to go beyond the binary classification and try to predict the different failure modes.

Figure 10. Confusion Matrix results for XGBoost Multi-Class model. Class Names: ‘Heat Dissipation Failure’: 0, ‘No Failure’: 1, ‘Overstrain Failure’: 2, ‘Power Failure’: 3, ‘Random Failures’: 4, ‘Tool Wear Failure’:

In this case for multi-class classification, the model was trained using the same parameters as inital model. The main difference is how the weights (sample_weight parameter) were estimated; as there are more than one class, an array was calculated containing the weights for class. Based on research, a common way to calculate these weights is to count the occurrence of classes and assign weights inversely proportional to this frequency.

Model as with binary classification is great at predicting the majority (no Failure) on this case. Class 0 which is heat-dissipation has a F1 Score of 0.95, Class 2: Overstrain Failure F1 score is 0.76 and Class 3: Power Failure is 0.75 which are similar to original model results. However, on this case model is not able to predict class 4 and 5 (Random Failures,Tool Wear Failure), F1 score is 0 for these.

One additional important outcome of these models is to understand what features are really important. Although the model performance is not great for all classes, understanding what features are important in the prediction model can help subject matter experts interpret results and take action to reduce failures and improve the process overall.

Detailed Jupyter Notebook can be found here: https://github.com/INFO-523-SU25/final-project-castro/blob/main/src/Model_Training_PDM.ipynb

Time Series Analysis

The second objective of this study, which is also very relevant in manufacturing, is how to use machine learning and data mining skills to detect outliers or anomalies in the process. Many of the data streams from equipment are time-based; they are collected or sampled at a certain frequency. Anomaly detection is very valuable for the industry, as having the ability to know when a machine or process is deviating from the ‘normal’ can help to stop and repair equipment quickly, reducing the impact due to downtimes or defects.

After pre-processing the data, the study is focused on understanding how algorithms like the seasonal_decompose and ARIMA (or auto_arima) can be used for anomaly detection. As initially observed, the data was randomly generated over a specific range; it does not follow a trend, and there is no “seasonality” in it.

Figure 11. Results from Deasonal Decompose. Observed data (Original Data), trend (Smothed), seasonal (results from patter fount for the period defined) and Residual (Delta between original data and trend/season components)

Assuming the decomposition is somehow accurate, we can use the residuals to define how each point is deviating from what was expected. Based on this assumption, a rule can be defined to identify what constitutes abnormal behavior.

# Anomalies in residuals
residuals = decomposition.resid.dropna() #Obtain residuals from decompositions
threshold = 2 * residuals.std() # Our rule, on this cases based on research we selected to use 2X the standard deviation of the residuals
anomalies = np.abs(residuals) > threshold # Applying the rule to obtain the anomalies

The second approach is to use ARIMA (Autoregressive Integrated Moving Average model). An ARIMA model is fitted using Auto_Arima from pmdarima to facilitate the definition of P/D/Q values. Similar to other methods, after fitting the model, we calculate the difference between the actual values and the predicted values and define rules for this difference.

                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                  139
Model:                        SARIMAX   Log Likelihood                -149.194
Date:                Mon, 18 Aug 2025   AIC                            302.388
Time:                        18:42:30   BIC                            308.257
Sample:                    01-01-2024   HQIC                           304.773
                         - 03-10-2024                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept      5.8176      0.060     96.802      0.000       5.700       5.935
sigma2         0.5010      0.059      8.537      0.000       0.386       0.616
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):                 0.16
Prob(Q):                              0.98   Prob(JB):                         0.92
Heteroskedasticity (H):               1.37   Skew:                             0.07
Prob(H) (two-sided):                  0.28   Kurtosis:                         3.10
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Best (p, d, q): (0, 0, 0)

Table 6. ARIMAX Results Summary

Figure 12. Results from ARIMA model and anomaly points detected based on pre-defined threshold of 2X the standard deviation of the residuals (True = anomaly).

Results from the predictions are almost a constant value around the center of the distribution, meaning the ARIMA model is closely predicting the mean for every value of the time series data. One reason for this could be that the data is not predictable; as originally stated, they originated from a random generator and was summarized. Why is this still a valid dataset? A real machine, as mentioned before, might be designed to run at a specific target, let’s say consuming 6 KW. There is a normal variation in processes and systems that adds noise, this variability in many cases is not predictable. Models like ARIMA or the seasonal decomposition might not be ideal for these situations, but they still provide some insights about what can be considered an outlier/anomaly. LSTM was also explored, with similar results.

Although this dataset might not be ideal for these models, as observed in Figure 12, the model is able to identify the points that deviate the most from the “normal” range, which is what we intended to explore in this analysis. Now, there might be simpler methods to do this for this specific case.

Details of analysis can be found here: https://github.com/INFO-523-SU25/final-project-castro/blob/main/src/Time_Series_Analysis.ipynb

Conclusions

  • The study demonstrated the application of concepts in machine learning to common manufacturing problems.
  • The Random Forest Classifier model ROC-AUC scores are high, indicating the model can differentiate effectively between fails and no-fails. However, in a real manufacturing process, the majority of the results are positive/pass, making this indicator not the best for this case. Recall and precision are more appropriate; depending on the use case, we would want to tune the model in one or the other direction or use the F1-score to optimize both.
  • The Random Forest model performed poorly for the F1 score when using a standard threshold of 0.5. The study demonstrated this can be improved by selecting an optimized threshold based on the model results.
  • Sampling can be a useful method to handle imbalanced datasets; however, in this specific case, it did not provide a significant improvement in model performance.
  • Assigning weights to each classes to handle the imbalance sample in combination with gradient boosting (XGboost) model, resulted in better results for F1-Score.
  • Multi-class classification results using the learnings from the binary-classification were demonstrated. Due to the nature and frequency of the failures and their relationship with the input features, two classes had zero F1-scores, meaning the model was not able to predict them. Other classes had acceptable F1-scores.
  • The study demonstrated how seasonal-decomposition and ARIMA can be used for anomaly detection in a real manufacturing use case; results showed how data points deviating the most from the center/target can be identified by using these methods. These methods might not be the best for a process where there are no patterns and data might just have random variability from the target.

References

  1. Shivam Bansal. “Machine Predictive Maintenance Classification Dataset.” Kaggle. Available at: https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification.

  2. Ziya. “Intelligent Manufacturing Dataset.” Kaggle. Available at: https://www.kaggle.com/datasets/ziya07/intelligent-manufacturing-dataset/data.

  3. INFO-523 University of Arizona. “Comparing Classifiers.” GitHub Notebook. Available at: https://github.com/dataprofessor/code/blob/master/python/comparing-classifiers.ipynb.

  4. L. Lemaître, A. Nogueira, and C. K. Aridas. “SMOTE: Synthetic Minority Over-sampling Technique.” In: imbalanced-learn documentation. Available at: https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html.

  5. L. Lemaître, A. Nogueira, and C. K. Aridas. “RandomUnderSampler.” In: imbalanced-learn documentation. Available at: https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html.

  6. T. Chen and C. Guestrin. “XGBoost for Imbalanced Classification.” XGBoosting.com. Available at: https://xgboosting.com/xgboost-for-imbalanced-classification.

  7. S. Puranik. “Calinski–Harabasz Index for K-Means Clustering Evaluation.” Towards Data Science, August 2021. Available at: https://towardsdatascience.com/calinski-harabasz-index-for-k-means-clustering-evaluation-using-python-4fefeeb2988e/.

  8. Skipper Seabold and Josef Perktold. “seasonal_decompose — Seasonal Decomposition of Time Series.” statsmodels, accessed 2025. Available at: https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html.

  9. Jason Brownlee. “Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras.” Machine Learning Mastery, March 10, 2018. Available at: https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/.

  10. Explanations, troubleshooting, grammar and clarifications were aided by ChatGPT (OpenAI, 2025). OpenAI. [Large language model]. https://chat.openai.com/